Parallel Tree Reduction on MapReduce
نویسندگان
چکیده
MapReduce, the de facto standard for large scale data-intensive applications, is a remarkable parallel programming model, allowing for easy parallelization of data intensive computations over many machines in a cloud. As huge tree data such as XML has achieved the status of the de facto standard for representing structured information, the situation calls for efficient MapReduce programs treating such a tree data structure in parallel. However, development of such MapReduce programs has remained a challenge. In this paper, restructuring our previous BSP algorithm for tree reduction computations, we propose a new MapReduce algorithm that can be used to implement various tree computations such as XPath queries. Our algorithm is designed to achieve linear speedup even for extreme inputs, and our experimental result shows that our prototype implementation actually achieves linear speedup even for monadic trees.
منابع مشابه
PDTSSE: A Scalable Parallel Decision Tree Algorithm Based on MapReduce
Parallel decision tree learning is an effective and efficient approach to scaling the decision tree to large data mining application. Aiming at large scale decision tree learning, we present a novel parallel decision tree learning algorithm in MapReduce framework, called PDTSSE (Parallel Decision Tree via Sampling Splitting points with Estimation). We first propose an estimation method for samp...
متن کاملA Tree Algorithm Based on Parallel Cloud Computing Model
Cloud computing is the development of parallel computing, distributed computing and grid computing, and with the advancement of cloud computing, how to design efficient distributed tree algorithm is receiving more and more attention, Constrained by parallel assumption, Parallel tree algorithm are not easy to express in MapReduce. Inspired by Bulk Synchronous Parallel model, we propose an enhanc...
متن کاملPLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google’s...
متن کاملA New Parallelization Method for K-means
K-means is a popular clustering method used in data mining area. To work with large datasets, researchers propose PKMeans, which is a parallel k-means on MapReduce [3]. However, the existing k-means parallelization methods including PKMeans have many limitations. It can’t finish all its iterations in one MapReduce job, so it has to repeat cascading MapReduce jobs in a loop until convergence. On...
متن کاملCloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012